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Abstract 

We present an approach for penalized tensor decomposition (PTD) that 
estimates smoothly varying latent factors in multi-way data. This gener¬ 
alizes existing work on sparse tensor decomposition and penalized matrix 
decompositions, in a manner parallel to the generalized lasso for regres¬ 
sion and smoothing problems. Our approach presents many nontrivial chal¬ 
lenges at the intersection of modeling and computation, which are studied 
in detail. An efficient coordinate-wise optimization algorithm for (PTD) is 
presented, and its convergence properties are characterized. The method 
is applied both to simulated data and real data on flu hospitalizations in 
Texas. These results show that our penalized tensor decomposition can of¬ 
fer major improvements on existing methods for analyzing multi-way data 
that exhibit smooth spatial or temporal features. 

Key words: multiway data, tensors, trend filtering, penalized methods, con¬ 
vex optimization 
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1 Introduction 


1.1 Structure and sparsity in multiway arrays 


In recent years there has been an increasing interest in the use of penalized meth¬ 
ods for matrix and tensor decompositions. As in classical principal-components 
analysis (PCA), the goal of these methods is to represent a high-dimensional 
data matrix or multiway array in terms of a lower-dimensional set of latent fac¬ 
tors. This line of work differs from classical techniques, however, in the use 
of penalty functions that encourage these estimated factors to be sparse, struc¬ 
tured, or both. As many previous authors have demonstrated, such regularized 
estimators usually exhibit a favorable bias-variance tradeoff, particularly when 
the size of the array far exceeds the number of samples. They can also make the 
estimated factors themselves much more interpretable to practitioners. 

Existing methods for penalized matrix decompositions have been shown to 
outperform classical PCA in discovering patterns in application areas such as ge¬ 
nomics and neuroscience. Penalties that encourage structure (such as the fused 
lasso) provide tnterpretable results when there is a natural order of the measure¬ 
ments, w hile penalties that e ncourage sparsity are useful when there is no such 
ordering ( Witten et al. . 2009|) . In the high-dimensional tensor setting however, 
existing decomposition methods only enforce sparse constraints. We address 
this gap by proposing a method for penalized tensor decomposition (PTD) that 
allows arbitrary combinations of sparse or structured penalties along different 
margins of a data array. 

Given a data array V = [Yus], the statistical problem that we study is to find a 
low-dimensional factor representation (also known as a Parafac decomposition) 
such that the factors are constrained to be sparse and/or smooth. For ease of 
presentation, we restrict attention to the three-way case, but the generalization 
of our approach to arrays with more than three modes is straightforward. 

More explicitly, suppose we are given a set of observations yi^t,s, the elements 
of a three dimensional tensor Y_ G that have been generated from the 

complete tensor model 


j 

yi,t,s = ^d*u^jOv;jowlj + ei^t,s, , I e t e s e 

j=i 

( 1 ) 

with unknown hidden vectors u* G R^, v* G R^, tc* G R"^, j = 1,..., J and 
scalars d*, j = 1,..., J. We will later discuss the missing data problem. For 
simplicity we assume that the variance cr^ of the error term ei^t,s is known and 
equal to 1. Moreover, when J = 1 we suppress the index j. Our goal is to 
estimate these latent factors, which can be challenging since we only have one 


2 









observation for each combination u*ip v^j, w*j. However, we assume that this 
task is aided by the presence of special structure in these true vectors. Explicitly, 
we assume that some of the vectors are restrictions 

of smooth functions defined in the interval [0,1]. For instance, it might be the 
case that = u*{l/L) for I = 1,..., L, where u* is a piecewise continuous or 
differentiable function on [0,1]. 

A natural situation in which this would arise is when one of the modes of the 
data array corresponds to a temporal or spatial axis. Our main contribution is to 
provide optimization algorithms for finding Parafac decompositions that shrink 
towards such structure. To do so, we apply a generalized lasso penalty along 
each mode of the array. We refer to this class of methods as penalized tensor 
decompositions (PTD). 

We face two main challenges in estimating the factors. First, the resulting op¬ 
timization problem is non-con yex. We prop ose to reach a stationary point using 
block coordinate descent, as in AllenI ( 2012h , and we provide convergence rates 
for a single-block udpate. This leads us to t he second ch allenge: unlike in the 
sparse unconstrained problem formulated by I AllenI (|2012h . for our case of a gen¬ 
eralized lasso penalty, it is not clear how to make the block-coordinate updates. 
Our results provide a novel way of doing so that exploits the multi-convex struc¬ 
ture of the problem, and that provides efficient algorithms for finding the factors 
when formulating the problem either in a penalized or constrained form. 


1.2 Relation to previous work 

Structurally constrained estimation is an active area of research, and we do not 
attempt a comprehensive review. Our work draws heavily in the one dimen¬ 
sional case on advances in understanding the one dimensional case, where pe¬ 


nalized regression has been widely studied in the literature ((Friedman et al. 


20101: Kim et al. . 20091: Tibshirani . 19961: Tibshirani et al. . 2005 ). For instance, in 
protein mass spectroscopy and gene expression data measured from a microar- 


ray, t he fused lasso has been used to obtain interpretable results ((Tibshirani et al. 
2 OO 5 I) . The fused lasso is a natural choice here, since it encourages neighbor¬ 


ing measurements to share the same underlying parameter. Similarly, to enfo rce 
smoothness in the solution, trend filtering has been proposed Kim et aP (I2OO9I) as 
a way to place one-dimensional function estimation within the convex optimiza¬ 
tion framework. The trend filtering penalized-regression problem has found ap¬ 
plications in areas as diverse as image processing and demography. 

In the case of matrix decomposition, the need for penalized methods arises in 
applications in genetic data, where there are multiple comparative genomic hy¬ 
bridizations and we expect correlation among observations at genetic loci tha t 
are close to each other along the chromosome. As shown in Witten et al. ( 20091) , 
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by considering different choices of penalties, we can recover different kinds of 
structures along either the ro ws or the columns of a data matrix. See the ref¬ 


erences m 


Witten et al.l (|2009ll for a much more comprehensive bibliography on 


sparse principal components analysis. 

In moving from matrices to multiway arrays, Parafac decompositions of¬ 
fer an attractive framework for recovering latent lower dimensional structure. 
This is due to their easy interpretability as well as feasibili t y of computation 


( Anandkumar et al.l 2014 : Harshman . 19701; Karatzoglou et^, 20101: Kolda and Bader , 


2 OO 9 I: Kroonenberg . 2008 ). More generally. Tucker models have been proposed as 
gener al models for m ultiway data and have been successfully applied in many 
areas (|Cichockil. l2013h . Other popular method s tor tensor decompositions in - 
clude those described in Bhaskara et al. ( 2014 ) and De Lathauwer et~^ ( 2000 ). 
However, these ap proach e s do not provide structural or sparse solutions. This 
point was made by Allen ( 2012|) . who proposed a sparse penalized Parafac de¬ 
composition method that outperforms the classi cal Parafac decom position when 
the true solutions are sparse. More recently, ISun et all (|2015l) also considers 
sparse tensor recovery and provides statistical guarantees for such a task. 

In this paper, we study methods for structured, as opposed to sparse, tensor 
factorizations. Ou r approach is insp ired by the penalized matrix decomposi¬ 
tion methods from lWitten et al.l (|2009l) . We generalize the matrix-decomposition 
problem to the framework of tensor Parafac decompositions while incorporating 
solution algorithms for a more broad class of penalties, including trend filtering 
for factors that are smooth (e.g in space or time). 


1.3 Basic definitions 

We now introduce notati on and definiti o ns us ed throughout the paper. This 
material can be found in ICichocki et al.l (|2009l) , to which we refer the reader 
for more details. Let I1J2—, In-, denote index N upper bounds. A tensor Y_ 
E K^ix-fax.-.x/jv Qf order N is an A— way array where elements are in¬ 

dexed by in e{l,2,....,Jn}, for n= 1, ...,A. Tensors are denoted by capital letters 
with a bar, e.g. Y_ E ]Rrixr2x...x/jv Matrices are denoted by capital letters, e.g Y, 
and for a matrix Y we denote by Y~ its generalized inverse. Vectors are de¬ 
noted by lower case letters, e.g y. The outer product of two vectors a G and 
6 G yields a rank-one matrix A = a o b = ab^ E and the outer prod¬ 


uct of three vectors a E W, b E W and c G yields a third-order rank-one 
tensor A = a o b o c E . We use || ■ ||f to indicate the usual Frobenius norm 

of tensors. The mode-n multiplication of a tensor Y_ E ]R^ix^2x...x/jv \yy ^ vector a 
E is denoted by Z := Y_ Xn a E ]R^ix - x/„_ix7„+ix...x/jv^ element-wise we 
have z. 
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1.4 Outline 


The rest of the paper is organized as follows. Section 2 defines our statistical 
approach to rank-1 tensor decompositions based on generalized lasso penalties. 
Section 3.1 then provides solution algorithms for our problem formulation when 
the penalties are used to define a set of constraints on the parameters. This is 
done by exploiting the efficiency of solution-path algorithms for generalized- 
lasso regression problems. In Section 3.2, we then study an unconstrained ver¬ 
sion of the problem where the penalties enter directly into the objective. Because 
the original problem is not convex, this is not equivalent to the constrained for¬ 
mulation, and some important algorithmic differences are highlighted. After 
developing algorithms for rank-1 tensor decompositions. Section 3 concludes by 
extending these ideas to the general case of multiple factors. 

Section 4 presents a convergence analysis for our fundamental rank-1 decom¬ 
position algorithm. In Section 5, using simulated data, we benchmark against 
state-of-the-art methods on rank-1 and multiple factor decompositions, measur¬ 
ing the error of recovery with the Frobenius norm. We then validate our algo¬ 
rithms on two real data sets involving flu hospitalizations in Texas and motion- 
capture data. Finally, Section 6 present a brief discussion of the overall frame¬ 
work proposed in this paper. 


2 Penalized tensor decompositions 


We fir st consider the case J = 1. Taking a point of view similar to IWitten et al.l 
(|2009ll . for positive constants c„, and Cw, we formulate the following problem: 


minimize \\Y — guovow\\^p 

neK-^,DeR^,iiieK®,s'eR 

subject to ||Z1 “m||i < c„, < Cw 

u^u = 1 , v^v = 1 , w'^w = 1 , 


( 2 ) 


where D'^ and D'^ are matrices which are designed to enforce structural con¬ 
straints. When the context is clear we will suppress the superscript and simply 
use the notation D. We note that an alternative, although non-equivalent, for¬ 
mulation is based on an unconstrained version of (|2]) given as 

minimize \\Y — guov o w\\% + A„ ||iA“M||i -t- ll-D^vHi -|- A„, ||iA"'w||i, 

u^u = 1 , v'^v = 1 , w'^w = 1 

(3) 

with the same unit-norm constraints on the factors. In Section 3, we will discuss 
the computational differences between these formulations in detail. 

We now briefly discuss a broad class of penalties of potential interest to prac- 
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titioners. We focus on choices that penalize first- and higher-order differences 
in eac h factor, which co rrespond to the fused lasso and trend filte ring, respec¬ 
tively ( Tibshirani . 2011 ). The fused lasso penalty was suggested in Witten et alJ 
(|2009 1 to detect regions of gain for sets of genes in matrix-decomposition prob¬ 
lems. For this penalty, the associated D matrix is the (S' — 1) x S' first-difference 
matrix, Dij = 1 if 7 = i, Djj = —lifj = i + 1 and Di j = 0 otherwise. As 
discussed in Tibshirani ( 2011 1. this penalty gives a piecewise-constant solution 
to linear-regression problems, and it is used in settings where the coordinates 
in the true model are closely related to their neig hbors. A related choice for D 
is oriented incidence matrices of a graph; see, e.g. lAmold and Tibshirani) (1201511 . 
These are constructed as generalizations of the 1-dimensional fused lasso on an 
underlying graph G. 

Still other choices for D correspond to polynomial trend filtering, which im¬ 
pose a piecewise polynomial structure on the underlying object of interest. These 
are constructed as follows. First define the polynomial trend filtering of order 1 
as Dtf i G where Dtf i = and G j^g pj.g|. qj._ 

der difference matrix. Then, recursively construct the polynomial trend filtering 
matrix of order k as Dtj^k = Di,d ■ Dtf,k-i- 

The polynomial trend filtering fits (especially for k = 3) are similar to those 
that one could obtain using regression splines and smoothing splines. However, 
the knots (changes in kth derivative) in trend filtering are selected adaptively 
based on the data, jointly with the inter-knot polynomial estimation (jTibshiranil. 
201 ih. A comp r ehens ive study of polynomial trend filtering can be fo und in 


Tibshirani et al. ( 2014ll . We note that Problem (O was already studied in Allen 
(|2012h for the case in which all the matrices V" and H"' are set to be the 
id entity. This is the case of having the LI penalty on each mode. The authors 
in lAllenI (|2012[) proposed a fast algorithm to solve the problem. However, the 


LI penalty has the disadvantage of encouraging only sparsity. If the true factors 
are not sparse but instead locally flat or smooth, then having spars e constraints 
on the factors performs poorly. This phenomenon was observed in Witten et al.l 
(|2009ll in the context of matrix decompositions, where the fused lasso penalty 
was shown to properly recover flat vectors in the factors of the decomposition 
when the LI penalty failed to do so. We will extend these ideas to tensor decom¬ 
positions, applying penalties from the generalized lasso class. We now turn to 
the question of how to fit these models efficiently. 
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3 Solution algorithms 


3.1 Constrained problem 

Since ((2) is a non-convex problem, we propose to consider a block coordinate- 
descent routine. However, in order to have convex block-coordinates-updates, 
we instead state the following problem: 


maximize 


'^XiU X2V 


subject to ||H“m||i < c„, ||H''r;||i < c^, ||H"'tc||i < 


(4) 


vFu < 1 , 


v^v < 1 , 


T 

W W 


< 1 . 


This differs from (|2 in two ways. First, the objective has been reformulated in 
a more c onvenient way, but it is easy to show that this results in an equivalent 
problem (jKolda and Bader , 2009h . Secondly, the unit norm constraints have been 
relaxed to the convex constraints that each factor fall into the unit ball. Addition¬ 


ally, following IWitten et al.l ([20091), a simple modification can naturally handle 
missing data. Denoting by M the set missing observations, we solve the missing 
data problem by replacing the objective function in dlj with the function 


F{u,v,w)= ^ Yi^t,sUiVtWs 


(5) 


Note that dU has a multilinear objective function in u, v, and w. Since the 
penalties induced by D“, D'^ and are convex, we can use coordinate-wise 
optimization in order to solve this problem. For example, when v and w are 
fixed, the update for u is found by solving the following problem: 


T 

maximize {}^X 2 V x^w) u subject to 

'll 


\\u 


|2 < 1 
I2 — ) 


||-D“m||i < C„. (6) 


It would seem that a solution to dU) would not in general have unit norm. But 
it is possible to ensure that this will be the case—that is, to ensure the solution 
follows on the boundary of the constraint set—as long as Cu is chosen properly 
based on t he KKT conditions. A similar phenomenon was observed for the ma¬ 
trix case in Witten et al. (t2009 ). One of our results is that the solution to dE]) will 
very often turn out to have unit norm, despite our convex relaxation. A rigorous 
statement of this result will be given later. 

Our strategy to solve dU is to sweep through the vectors iteratively by pro¬ 
ceeding with block coordinates updates. Thus starting from initials and 

we proceed by solving, at iteration m, the problems shown in Algorithm [T1 It 
should be pointed out here that the best we can hope with Algorithm 1 is to ob- 
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= arg min |(—y X2 ^ x^w'^ u subject to \\u\\l <1, ||-D“m||i < c„.| 
= arg min |(—y Xi X3 V subject to ||u||2 <1, H-D^wHi < c„-| 

w'^ = arg min |(—y Xi X2 w subject to ||w||2 <1, ||-D'^w||i < c^.| 


Algorithm 1: Constrained problem block coordinate descent 


tain a local minimum to dU). It will be shown latter with our experiments that this 
local minimum provides interpretable and accurate estimators. Note that while 
the algorithm is structurally quite simple, the individual block-coordinate up¬ 
dates are non-trivial to solve efficiently. The remainder of this section discusses 
how this can be done. 

Given the symmetry of the problem, without loss of generalify, we focus on 
the update for u. We notice fhaf fhe consfrainf sef involves a non-differenfiable 
function, implying thaf if is nof possible fo use a gradienf-based mefhod. Before 
describing our approach, we firsf discuss two natural possibilities and explain 
why they were ultimately rejected. 

First, a simple approach is to include a slack variable z = D'^u and use the 
ADMM algorithm. However, the resulting update for u would require solving a 
constrained problem using, for example, an inferior-point method. This rapidly 
becomes infeasible, since it requires solving a large dense linear system. 


A second natural approach is to use the novel ADMM algorithm from IZhu 


(|2015ll fo solve each of the block-coordinate updates. For instance, the update for 
u would involve solving the problem 


u = arg mm 

U 

subject to 


-y xs-c™"^ X3 w'" 




u 


\u 


I 2 <1 


11 < c„, 2 : = {Eu — = X, 

_ (7) 

where is a matrix such that E^ y Then proceeding as in iZhul feoish . 

we observe that (fT7|l can be solved in linear time, as the update for m is a simple 
projection on the unit ^2 b all, while the update for x requires projecting in a £1 ball 
with the algorithm from iDuchi ef al.l (|2008|). (The actual updates for our prob¬ 
lem are given in the appendix.) However, while this algorithm indeed solves 
the constrained-problem updates, we find in that practice the ADMM routine 
requires a long time to converge. In particular, it presents problems enforcing 


the constraint that \\D^u^' 


Ii < 


so that the solution returned after reasonable 


runtimes is actually quite far from fhe feasible region. 

This motivates us to consider a different approach to solve the block-coordinate 
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updates in ([T]). We appeal to the following theorem, which suggests a simple 
method and also implies that, typically, the solution lies on the boundary of the 
unit ball. That is, it satisfies the non-convex constraint of problem (|2]), despite 
our relaxation. 


Theorem 1. Assume that Cu > 0 and F X2f Xsw ^ Range ((id“)^). Then the solution 
to (HI) is given by 

* _ {-Yx2vy<3W-{D^fyx*) 

II - F X 2 n X 3 w - (T)“)^7 a«||2 

where 


7a = argmin -|| - F X2 n X3 w - (T)“)'^7||^ 

ll7l|oc<A ^ 

A* = argmin [II-F X2 n X3 w - (T)“)^7 a|| 2 + Ac„] . 

0<A 


(9) 

( 10 ) 


As a direct consequence of the proof of Theo rem [H we can sol ve dU) by first 
solving (ID) with the solution-path algorithm from Tibshira^ (|201lh , then finding 
A* and finally u*. The explicit algorithm is given in the appendix. 

Unfortunately, there is no characterization available of of the computational 
time to compute the solution path. It is only known the cost at each iteration 
is 0{L) in its worst case, but it is unknown how many kinks K that a particular 
problem will have. Moreover, we notice that after the solution path is computed, 
the next two steps require 0{KL) cost. Therefore, the total cost for updating u is 
0{KL). 


3.2 Unconstrained version 

The framework we have introduced for rank-1 approximations has some nice 
features. In particular, the choice of tuning parameters is more intuitive, since 
this directly imposes a constraint on the smoothness of the solutions. However, 
the optimization routine derived from Theorem [T] is computationally intensive. 
In particular, for large dimensions of the penalty matrices, computing the en¬ 
tire solution path can still be somewhat slow. To avoid this, we revisit (O and 
consider a problem equivalent to its convex relaxation: 

minimize — F Xi m X 2 f X 3 w + A„ ||H“m||i + A^ ||H’'t>||i + A,„ ||Zl"'r(;||i 

subject to u^u < 1 , v'^v < 1 , w'^w < 1 . 

( 11 ) 
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As in the constrained case, we solve (fTTIl via block-coordinate updates. Now the 
update for u is obtained by solving 

minimizing — (H X 2 v X 3 subject to ||m ||2 < 1 . ( 12 ) 

U 


The solution to (|T^ can be characterized in the same maimer as for the con¬ 
strained case. In fact, the proof of Theorem [T] implies the following corollary: 

Corollary 2. With the notation and assumptions from Theorem ([T])/ the solution to 

minimize — (y^X 2 V x^w)^ u +X\\D'^u\\i subject to ||m ||2 < 1 (13) 

has the following form, where f\ is defined in @; 

(-(1:x2i;X3I/;)^-(/1T7a) 

Vj* = —. (14) 

II - (F X 2 V X 3 w)^ - 

An interesting consequence of the closed-form formula (ITU) , and the proof of 
Theorem 111, is that we can solve (IT^ by first solving a generalized lasso problem 
and then projecting the solution into the unit ball. Explicitly, we first find 

-u = arg min {||m — y X 2 v X 3 WII 2 -I- A ||Z1“m||i, , subject to ||m ||2 < 1} (15) 


and m/||m ||2 becomes the solution to (HU). Therefore, for trend-filtering problems, 
we can solve the regres s ion p roblem step with the fast ADMM algorithm from 


Ramdas and Tibshiranil (|2015t) . Moreover, f or the case of a fused lasso penalty. 


the update for u can be done in linear time (|Tohnsonl. 1201311 . Because these two 
algorithms are so efficient, the penalized formulation from (ITT1) can be solved 
much more cheaply than the constrained formulation from H]). 


3.3 A toy example 

We illustrate the advantage of problem (fTTIl over the formulation from ((4]| us¬ 
ing a toy example. We consider u* E and w* E ^400 

as the size of v* 

varies. Here, u* and w* are as in Structure 2 in Figure flf, while v* is the func¬ 
tion cos(9 7 ri) evaluated at evenly spaced locations in [0,1]. Taking initial val¬ 
ues from the power method, we compare the solutions from one iteration of 
the unconstrained formulation when choosing the penalty parameters adap¬ 
tively, versus an "oracle" version of the constrained problem with (c„, cfr = 
(||Z1“m*||i, ||iA’'i;*||i, ||Zl"’w*||i). This choice of hyperparameters for the constrained 
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Figure 1: Panel (a): Frobenius error comparison of the of three different methods 
for finding a rank-1 dec omposition. These are: Algorithm [T] with the ADMM 
method from izhul ( 2015h , block coordinate descent for solving the unconstrained 
problem (fTT]) . and Algorithm [T] using the solution path method as described in 
Section 3.1. Panel (b): For each of the methods, time in seconds for solving 
one problem with a particular choice of tuning parameters. Our unconstrained 
formulation with adaptive chosen penalties achieves nearly the reconstruction 
error of the unconstrained formulation with optimal hyperparameter choice, but 
at far less computational cost. 


problem is obviously optimal, but requires knowledge of the true factors, and is 
therefore unrealistic in practice. 

Figure [T] demonstrates the favorable trade-off ottered by the unconstrained 
formulation with adaptively chosen tuning parameters. We observe that while 
the constrained formulation algorithm based on the solution-path computation 
is the most accurate, the unconstrained formulation is competitive in terms of 
reconstruc tion error, and much more efficient. The ADMM algorithm based on 
Zhul (|2015|l is substantially less accurate than the other two methods. 

Moreover, in practice it would be necessary to solve the constrained prob¬ 
lem with more than one value of the tuning parameters, since we do not know 
||D“m*||i. Hence the penalized version is strongly preferred: we can do adaptive 
parameter choice more cheaply than solving the constrained version for a sin¬ 
gle hyperparameter setting, without a major loss of performance even under an 
optimal hyperparameter choice. 

With regards to the choice of regularization, we can consider two altern atives 
based on cross validation. The first of these follows IWitten et al.l (|2009|) . This 
procedure involves randomly deleting a percentage of the input data and solves 
the problem on the resulting tensor. The estimated tensor produces predicted 
values on the deleted entries, allowing one to compute mean square error of 
prediction for these notionally missing values. The parameters A„, Ay and 
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are then chosen to minimize the prediction error. This is particularly attractive 
when multiple processors are available, given that independent problems with 
different tuning parameters can be solved in parallel. 

The other alternative for cross validation applies to (ITT1) and it is based on 
adaptively choosing the tuning parameters. Thus, before estimating each vec¬ 
tor (say u), we obtain a generalized lasso regression problem and hence we can 
choose Xu by cross validation. We randomly separate the coordinates of the re¬ 
sponse vector into training and test set, solving the problem in the training set 
and computing the mean squared error of the predicted solution on the test set. 
This exploits the fact that m is a smooth function, and therefore given a solution 
based on the training set, we can provide estimates at the locations in the test set 
by interpolation. 


3.4 Multiple factors 

In the case of multiple factors, the main difference of t he tensor case ver s us th e 
matrix case is that we must find all the factors jointly (jKolda and Baderi. 120090 . 
as opposed to estimating factor k + I using the residual from the fitted fc-factor 
model. Fortunately, it is straightforward to use any of the algorithms in the 
previous section to handle multiple factors. Hence, to estimate the factors in ([T]), 
we state the problem 


j ./ 

minimize ||F — ^ dj Uj o Vj o Wj\\l + ^ [Xuj\\DJ Uj\\i + Vj\\i + Xu}j\\DJ Wj 

Uj ,Vj ,Wj 

J=1 J=1 

subjectto ||wll 2 ^ 1 |IhII 2 ^ 1 llwjlls < 1 j = 

(16) 

where the matrices DJ,Dj and DJ are chosen to capture different structural fea¬ 
tures desired for the solutions. Here, Xuj, Xyj and Xu,j are tuning parameters. 
Now we solve (IT^ by starting with initial guesses {u^}, {v^}, {w^}, {d^} and ap¬ 
plying the iterative updates listed in Algorithmic] exploiting results from Section 
3.2. 

In practice the number of latent factors can be chosen with an ad-hoc rule by 
looking at the proportion of the variance explained (as with a scree plot in ordi¬ 
nary PCA). One can look at the solutions provided by different values of J. The 
choice of J then corresponds to the number factors such that the increase in vari¬ 
ance explained obtained by solving the problem with more factors is negligible. 

We illustrate this in our real data example. 

Finally, in situations where the number of factors is large, the number of pos¬ 
sible combinations of tuning parameters becomes challenging. One possibility 
to address this is to choose the parameters adaptively as discussed in Section 
3.2. Hence, every time a factor is to be updated we select the parameter from a 
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Loop for jo = 1 : J, 




arg min 

lblli<i 

u-} 

vJo 


arg min 

Ibll2<i 

f — 1 

yjio 


arg min 

w — 



Y xi X 2 




(yio; 


[wY'^ 

2 


j¥^jo 

d^ 



{w^°Y '^j 

2 

^ T ^vjo 



d^ 




2 

^ T XujJq 



Y_Xi u^° X2 X3 w^° — {u^°Y (y^°Y' wK 


End loop 


Algorithm 2: Multiple factors 


small grid of values. This ensures that, tor instance, when dealing with fused 
lasso penalties each block coordinated update can be done in linear time. On 
the other hand, a different alternative is to use the same penalty parameter tor 
all the vectors corresponding to the same level of smoothness. For instance, one 
can use A„, j = Xu,i it = D^\ This reduces the burden of cross-validation. 


4 Convergence analysis 


We now examine the convergence of the block-coordinate algorithms developed 
in the previous section. Here, we assume that J = 1 in Model ([T]). In this case we 
recall that the underlying true tensor can be decomposed as the outer product 
of vectors u* G v* G and w* G times a constant d*. Moreover, we 
assume that the matrices D are chosen to be either fused lasso or trend filtering 
penalties. Thus, G D"" = G and D'^ = 


Our proof is inspired b y the work on cony ergence rates for generalized lasso 
regression problems from Wang et al. ( 2ni4h . The theorem states that, when 
starting with good initials, it is necessary to sweep through the data only once. 
The proof of the claim is based on the identity 


F{AnBnC)= P(A)P(5 I A)P(C' \AnB) 


for any events A, B and C. A related statement can be made in the case of 
multiple factors for a single update depending on the other factors. See the result 
in the appendix; the main difference there involves an error measurement that 
depends on the factors taken as fixed. 
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Theorem 3. Let {m^, v^,w^} denote a one-step update from Algorithm^ based on ini¬ 
tial values {u^,v^,w^}, and assume that \\D^u*\\i < Cjjj, < Cjjj,and\\D'^w*\\i < 

Cyj. Then, there exists a constant c > 0 such that ift > 0 satisfies 

f ct ct ct ^ 1 

^ d* ’ d*\/f ^ d* ’ d*^fS ^ d* J “ 2^’ 

and 


then 


< 2-^/f 


\w 


w 


< 2-^/f 


p(||w' - U*\\l < 16 , ||^;1 - ^;*||2 < 16 


( ct I 2c^ r*^^+i/2 

\d*VT ^ d* 


1 


where 


x) 



21/2 


X^/^a/ 5 TT log(x) 


Theorem |3] states that with good initials our rank-1 decomposition algorithm 
will be very close to the true factors under weak assumptions concerning the 
smoothness of the true factors. Thus, in practice before rurming our algorithms, 
we can consider a simple initialization that consists od solving Algorithm ([T]) for 
the case where the matrices D"^, Df and D"^ are all zero. This is known as the 
power method ( Kolda and Bader . 2009h . Moreover, statistical guarantees for a 
very related method to this procedure were studied in lAnandkumar et al. ( 2014 1. 

Finally, it should be note that Theorem|3]implicitly suggests that an appropri¬ 
ate choice of tuning parameter is (c„, c„, c^) = (||Z1“m||i, ||Z1^v|1i, ||Z1"'w||i) which 
only involves the true latent vectors. In the case of the unconstrained version, a 
very similar statement to Theorem |3] holds by taking ^J\og{L)), 

K = 0 (T"“+i /2 and 


Finally, we note that, as one would expect, the larger d* is, the better we 
should expect to perform. This is intuitive, given that when d* increases and the 
unit vectors u*, v* and w* are fixed, the standard Gaussian noise becomes small 
compared to the magnitude of the observations. 
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Figure 2: Latent vectors generating the structures for our examples. Each row 
gives rise to a different structure by taking the outer product on the correspond¬ 
ing vectors. 


5 Experiments 


Our experiments focus mainly on the task of rank-1 recovery, since all of our al¬ 
gorithms are based on the development of a rank-1 PTD. For all our simulations 
we use the Frobenius norm of the difference befween the estimated and true ten¬ 
sors as a measure of overall accuracy. The Frobenius norm is a natural choice 
of model fit, since we also benchmark against a recovery method that does not 
directly produce a rank-1 tensor but does provide an estimate of the true mean 
tensor. This method is based on the idea of stacking s e veral penalized matrix 
decompositions using the technique from IWitten et al.l (1200911 . Specifically, we 
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consider the tensor of observations X as a collection of 10 distinct 1000 x 400 
matrices, each of which is estimated via a rank-1 PMD. This will lead to 10 esti¬ 
mated rank-1 matrices which are concatenated to build a 10 x 1000 x 400 tensor. 

We call this procedure, with an abuse of notation, PMD(P^, P^) where and P^ 
are the penalties on v and w, when computing the rank-1 PMD matrices. 

The other methods included in the study are the PTD with different penal¬ 
ties Pu, Pv, Pw denoted as PTD(P^i, P^, P^). We consider choices such as the LI 
penalty, the fussed lasso (PL) and trend filte ring of order k (TFk). No te that we 
are implicitly comparing to the method from Anandkumar et al. ( 2014h since, for 
rank-1 recovery, this reduces to the power method, and hence to PTD(L1,L1,L1) 
for appropriate parameters. 

For our simulations, the tuning parameters by cross validation on a grid of 
possible values for each of the parameters A„, A^, and A^. In every experiment, 
we randomly select 10% of the data for testing, using the other 90% as training 
data. Out of a range of candidate tuning parameters we select those that pro¬ 
duce the smallest error on the 10% held-out set. This process is repeated for each 
of 100 simulations, for different methods and structures, in order to obtain aver¬ 
age Frobenius errors for all the competing methodologies with respect to every 
structure. 

To see how different choices of penalties can behave under different scenar¬ 
ios, we ran experiments using five different rank-1 tensors as the true mean ten¬ 
sor. These choices are designed to explore a range of plausible structures that we 
might find in real problems. For the first structure both v and w are piecewise 
flat. For the second, both v and w are periodic functions. For the third, both v 
and w are piecewise quadratic polynomials. For the fourth, v is smooth and w 
is piecewise constant. For the fifth, both v and w are sparse but with no spe¬ 
cific structural pattern like smoothness or flatness. The goal of this final scenario 
is to understand how structural penalties perform in a data set where they are 
not warranted. Further details of this simulation are included in the appendix. 

Figure |2] also shows a plot of these different structures. 

The results of our simulation study are shown in Table [H In all cases, PTD 
converged with few iterations, usually less than 10. From these results, it is clear 
that different choices of penalty are suitable for different problems. For structure 
1, in which the true v and w are piecewise flat, the combination PTD(L1, FT, FT) 
outperforms all the other choices that we considered. Interestingly, PTD(L1,TF1,FL) 
and PTD(L1, TF1,TF1) provided better results than the "stacking" method PMD(FL,FL). 
Note also that PTD(L1,TF1,FL) and PTD(L1,TF1,TF1) behave fairly similar to one 
another. This is reasonable since a piecewise constant function is a special case of 
a piecewise linear function and hence we would expect that TFl would produce 
only slightly worse results than fused lasso. 

Moreover, Table [T] also illustrates when our methodology should not be ex- 
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Table 1: Comparison of the Frobenius norm error between the true tensor and 
the estimated tensor using different methods. 


Method 

Structure 1 

Structure 2 

Structure 3 

Structure 4 

Structure 5 

PTD(L1,L1,L1) 

37.37 

47.63 

46.16 

39.91 

40.58 

PTD(L1,FL,FL) 

6.31 

27.54 

11.76 

10.30 

57.15 

PTD(L1,TF1,FL) 

15.07 

20.49 

11.55 

9.00 

70.32 

PTD(L1,TF1,TF1) 

17.61 

14.40 

11.85 

12.40 

79.25 

PMD(L1,L1) 

85.05 

89.10 

100.70 

91.89 

72.87 

PMD(L1,FL) 

49.09 

50.14 

52.70 

22.73 

92.20 

PMD(FL,FL) 

15.05 

43.17 

25.64 

33.95 

114.09 


Table 2: Comparison of the Frobenius norm error between the true tensor and 
the estimated tensor using for different levels of noise and a fixed structure, av¬ 
eraging over 100 Monte Carlo simulations 


Method 

cr = 1.25 

cr = 1.50 

cr = 1.75 

cr = 2.00 

cr = 2.25 

PTD(L1,L1,L1) 

62.66 

81.66 

80.46 

99.50 

94.37 

PTD(L1,FL,FL) 

32.61 

38.80 

41.63 

46.32 

49.33 

PTD(L1,TF1,FL) 

24.55 

28.55 

32.35 

37.87 

38.43 

PTD(L1,TF1,TF1) 

17.00 

21.35 

22.27 

27.09 

27.36 

PMD(L1,L1) 

116.19 

139.57 

158.71 

185.05 

209.45 

PMD(L1,FL) 

66.80 

76.81 

83.65 

98.18 

111.09 

PMD(FL,FL) 

52.43 

57.52 

65.36 

83.98 

92.71 


pected to work. This is what happens with structure 5, where there is no spatial 
pattern in the true vectors u, v and w, and instead they are merely sparse (80% of 
their coordinates are zero). Here, as expected, PTD(L1,L1,L1) outperforms any 
of our methods. 

In the previous experiment we simulated all data sets with the assumption 
that the noise had variance 1. Now we fix the rank-1 tensor mean of Structure 
2, where both v and w are periodic functions, and then we compare the per¬ 
formance of different methods as the standard deviation of the noise changes. 
Recalling that in Structure 2 both v and w are periodic smooth, it does not come 
as a surprise that PTD(L1,TF1,TF1) provides the best performance in all situa¬ 
tions considered in Table |2l In addition, it is clear that the error of all methods 
increases as the variance of the noise does. Nevertheless, the performance of our 
method seems to tbe the most stable. 

Finally, we evaluate the recovery of mean tensors having multiple factors, 
with cr = 1. Scenarios where the true model consists of J = 2 and J = 3 
are considered. Our comparisons are based on taking sums of different rank- 
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Table 3: Comparison of the Frobenius norm error between the true tensor and 
the estimated tensor using different methods, averaging over 100 Monte Carlo 
simulations 


Method Structures 



1,2 

1,3 

1,4 

2,3 

2,4 

3,4 

1,2,3 

1,2,4 

1,3,4 

2,3,4 

Anandkumar 

544.0 

310.5 

85.4 

121.0 

128.9 

273.4 

534.9 

555.3 

346.6 

350.3 

PTD(L1,FL,FL) 

55.3 

46.5 

27.1 

71.2 

59.7 

107.3 

184.2 

48.3 

102.8 

126.1 

PTD(L1,TF1,TF1) 

51.7 

71.6 

67.8 

49.2 

50.9 

94.0 

120.3 

75.2 

141.6 

120.8 


1 tensors using the structures discussed before. The compe ting methods are 
PTD(L 1,FL,FL) and PTD(L1,TF1,TF1), versus Algorithm 1 from l Anandkumar et al. 


(|2014ll . For the latter, we set the number of initializations L = 30 and the num¬ 
ber of iterations N = 10. The results in Table |3| show a clear gain for our ap¬ 
proach over the method from lAnandkumar et al.l (|2014h . which do not impose 
any smoothness constraints on its solutions. 


6 Real data examples 

6.1 Flu hospitalizations in Texas 

As a simple illustrative example, we consider measurements of flu activity and 
atmospheric conditions in Texas, see the appendix for information how to col¬ 
lect the data. There are 5 variables measured daily across 25 cities in Texas from 
January 1, 2003 to December 31, 2009. The variables are; maximum and daily 
average observed concentration of particulate matter (air quality measure), max¬ 
imum and minimum temperature, and a measure of flu intensity capturing flu- 
related hospitalizations per million people. The data tensor is thus a 5x25x2556 
array where we expect clear temporal patterns, along with correlations among 
the five variables. For example, during the winter months we would expect an 
increase in flu-related hospitalizations, correlated with seasonal patterns of max¬ 
imum and minimum daily temperatures. 

To show the kind of interesting results that one can get with our methods, we 
compute a two-factor Parafac decomposition. We use trend filtering of order 2 in 
the temporal mode and no penalty on the other two modes (although it would 
be straightforward to incorporate a penalty on the spatial mode as well.) We use 
our main result dl]) to find the factors using coordinate-wise optimization. The 
tuning parameter for the trend-filtering penalty is chosen by cross validation 
from a grid of values to ensure that we get a smooth vector for the time mode. 
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Time vector for first factor 


First factor loadings matrix 



Figure 3: (a) Time vector for the first factor (b) Loadings matrix for first factor (c) 
Time vector for second factor (d) Loadings matrix for second factor. 


We considered fitting models with different values of J, we found that a 
model with one factor explains 36% of the variance, a model with two factors 
explains %45 percent of the variance, and a model with three factors results in 
increase in variance explained of less than %1 with respect to the case J = 2. 
Moreover, the model with 3 factors results in highly correlated factors. For this 
reason we use a model with 2 factors. 

From Figure |3] we note a clear seasonal effect. In the first factor we observed 
that the loadings for the flu intensity, minimum temperature, and maximum 
temperature can be all explained in a similar way. For the first of these three 
variable the loadings are positive in every city. Hence, given the shape of the 
time vector we see a periodic pattern of flu cases across cities with the highest 
during the winter months and the lowest during the summer months. 

6.2 Motion capture data 

For a more challenging task, we evaluate the performance of our PTD method 
using data from the motion capture (moCap) repository at mocap . cs . emu . edu. 
This consists of subjects performing different physical activities in repeated in¬ 
dependent trials. We construct 3-array tensors by taking sets of videos as one 
mode, 12 representative variables of the body movements as the second mode. 


19 














and data frames in time as the third mode. The 12 variables are listed in the 
appendix. 

We built 2 tensors each for 5 different tasks, with each task generating a 
training-set tensor and a test-set tensor. The training set tensor corresponds to a 
single subject performing multiple repetitions of a single related set of physical 
activities. Similarly, the corresponding test-set tensor corresponds to that same 
subject performing further repetitions of those same activities. For example, the 
first data set (comprising 1 tensor in the training set and 1 tensor in the test set) 
is called 126-swimmtng; this is formed by looking at 8 videos of subject 126 per¬ 
forming different swimming styles. In the moCap repository, videos 1,3,6,8 are 
used for training while videos 2,4,7,9 are used for testing. This results in both 
tensors having dimensions 4x253x12. 

The other four data sets, explained in detail in the appendix, are 138-story 
(subject 138 walking and moving arms); 107-walking (subject 107 walking with 
obstacles); 9-ruimtng (subject 9 rurming); and 138-marchtng (just like it sounds). 
For these data sets, the tensors dimensions are 4 x 325 x 12,4 x 828 x 12,4 x 128 x 12, 
4 X 371 X 12 respectively. 

In this context, our PTD approach can be thought of as a smoothing step ap¬ 
plied to the training-set tensor, to yield better out-of-sample predictions for the 
test-set tensor. We evaluate the performance of the method by calculating the re¬ 
construction error (again, by Frobenius norm) when using the fitted/smoothed 
training-set tensor to predict the corresponding test-set tensor. 

We find that for the tensors considered here, rank-1 is the best Parafac de¬ 
composition, since models with higher factors result in strongly correlated fac¬ 
tors. We ran our rank-1 PTD with a trend-filtering penalty of order 2 on the 
second mode, and no constraints in the other modes. We compare against the 
PMD using the same degree of smoothness, as well as the classical PCA method 
from I Anaridhumar et al. ( 2014 1. From Table |4] it is clear that PTD offers the best 
performance. Thus we can see the gain of using smooth penalties, reflecting 
the fact that physical movements involve motion-capture variables that change 
smoothly in time. Moreover, it is clearly favorable to pool information across 
videos, as our method does, rather than treating them independently, as with 
the PMD algorithm. 


7 Discussion 

In many problems, tensors offer a natural way to represent high-dimensional, 
multiway data sets. However, tensors by themselves are difficult to interpret, 
creating the need for methods that shrink towards some simpler, low-dimensional 
structure. 
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Table 4: Comparison of the Frobenius norm error between the estimated tensor 
and the test tensor tor the moCap datasets 


Method 

126-swimmmg 

138-story 

Task 
107-walking 

9-runnmg 

138-marching 

Anandkumar 

254.80 

134.63 

135.17 

84.40 

143.86 

PTD(L1,TF2,TF2) 

250.98 

131.78 

134.92 

84.29 

142.44 

PMD(L1,TF2,TF2) 

267.89 

145.14 

143.43 

88.06 

149.41 


Parafac models have been widely used for this task, but existing state-of-the- 
art methods typically constrain the factors to be orthogonal, or simply do not 
enforce any constraints. As we have shown, this can be undesirable in practice, 
especially if one is looking for more interpretable factors, where there is a natural 
spatial or temporal relation between observation, and it is expected that the fac¬ 
tors will be smooth. We fill this gap by providing a set of methods that precisely 
offer piecewise smooth Parafac decompositions. Our methods exploit state of 
the art convex optimization algorithms and are shown to have excellent perfor¬ 
mance in our experiments. We leave for future work the study of algorithms for 
more general classes of penalties that can potentially be non-convex. 

Finally, we have shown two alternatives for finding our smooth tensor de¬ 
compositions with generalized lasso penalties. The constrained formulation seems 
to be an attractive option for practitioners, with clear intuitive control over the 
level of smoothness exhibited by the solutions. On the other hand, in light of its 
computational advantages, the unconstrained formulation offers a more practi¬ 
cal approach, especially if there is no pre-existing knowledge about the antici¬ 
pated smoothness of the solutions. 
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A ADMM algorithm to solve the constrained updates 


In this section we discuss how to find the updat es for Algo rithm 1 from the 
main document using the ADMM algorithm from Zhu ( 2ni5h . Since these are 
symmetric we focus on the particular update m'". In this case the problem is 


m”* = arg min {—Y_X 2 V^ ^ X 3 

U 

subject to ||m ||2 < 1 , ||^||i < c„, 

z {Eu - = z. 

We define y as 


(17) 


y = Yx2V^-^ 

and solve (ITTb , using the ADMM algorithm from izhul (|2015ll , by considering the 
iterative updates 


Uk+i = aigmin {l\\y - u\\l + 2 {ak - ak-i)'^ u + - Uk)'^ Eu {u - Uk)} 

ll'^lli —1 

_ {ak-ak-l)+^Eu ut 

|||-(D“)^ {ak-ak-i)+^EuUk\\2 

Zk+1 = argmin{||D“Mfe+i+ - 2;|||} 

11 111 ^ 

Ufc+I = ak + p (D“ Uk+i - Zk+i) 


where the update for Zk+i can be done using the algorithm from iDuchi et al. 

6^. 


As explained in the main manuscript, in practice, using update as part of 
an ADMM algorithm leads to difficulty enforcing the ii constraint in reasonable 
runtimes, and results in larger reconstruction error than the technique we have 
recommended. 


B Proof of technical results 

B.l Proof of Theorem H] 

Note that the Lagrange dual function of the original problem is given by 

L (A, /i) = minimize \ 

= minimize [ 

u ^ 

A, /i > 0. 


+A(||Dm||i- cs)+/i(||M||2 - l)] 
—x^u + A||Dm||i + /illwll^] — P — ^cs 
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Next, define for fixed A, /r > 0, the function gx^^\ —)■ M given by 


9\^l (m) = -x'^u + A||L>m||i + /i||M| 


2 - 


From (|T8|) we need to solve the following problem: 


minimize gx,^ {u) , 

U 

which can be rewriten as 


(18) 

(19) 


minimize \—x^u + A|l2;||i + /i||M||2l 
s.t z = Du . 

This problem has the following Lagrangian: 

Lx,fi {z, u, 7) = -x^u + All^lli + gi\\u\\l + 7"^ {Du - z) , 

which is nicely separable in u and z. 

Let us now consider some special cases of p and A. First, if A = 0 and p = 0, 
then clearly. 


min La,;, ( 2 ;, M, 7 ) =-00 V 7 , 

Z,U 

Second, if A = 0 and p > 0, then 


min \—x'^u + A||iAM||i + pIImIIoI = ——x^x. 

u 4 p 

Next, if A > 0 and p = 0, then 


min La,;, {z, m, 7 ) = —00 V 7 with D'^'y 7 ^ x, 

Z^U ’ 

and 


min La,;, ( 2 :, M, 7 ) = 0 V 7 with D'^'y = x and II 7 II 00 < A. 

Z,U ’ 

Thus 


min \—x'^u + A||iAM||il 

L J 


—00 if X ^ Range (D^) 

0 if there exist 7 with D'^-y = x and || 7 ||oo < A. 


Finally, let us now focus on p > 0 or A > 0. Then 
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mm 


in [—x'^u + /i||M ||2 + 7^-Dm] = — — \\x — D 


T , ||2 

2i 


while (see Tibshirani (|2011 h 


mm 

Z 


in [A||z||i - -f'^z] = 


0 if 


<A, 


-oo otherwise. 


Hence, the dual problem to (HU) is equivalent to 


\x — 

subject to ||7lloo<A. 


mmimize — \\j. — i-y ^\2 

7 4p 


But for /i > 0 fixed, this is equivalent to solving the problem 


minimize -llx — iA 


T^,\\2 


S.t 


< A, 


( 20 ) 


which can be so lved for every A > 0 using the solution path algorithm from 
Tibshirani ( 2011 ). 

Let us denote by 7 a the solution to (l20l) for a fixed A. Therefore, 

= -^||x - T>'^7 a|| 2 -/i - Acs, 

4/i 

which implies that the dual to the original problem becomes 


maximize 




- D'^^xWl - 1^ - >^C 


( 21 ) 


Finally, recall from Boyd and Vandenberghd (120041) that any u* solution to the 
original problem must also solve 


u 


= arg min [—x'^u + A*||iAM||i + /i*||m|| 2 ] , 


for A* and p* that are optimal for ((2T1) . However, the objective function in (ITU) is 
strictly convex since /i* > 0, and so its solution u* is unique and also solves 
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minimize \—x'^u + A*|| 2 :||i + /i*||M|| 2 l 

subejct to 2 : = Du. 


The KKT optimality conditions for this problem imply that 

for some a subgradient of the function z ^ \\z\\i at z* = D u*. Therefore 

* (x - T>'^7 a*) 

u = j- nr- II ’ 

\\x - -D^7a*||2 

and the result follows. 


□ 


B.2 Proof of Theorem 2 

Here we assume that data is generated as 


Y_= d* u* o V* o* w + e 


and 


^ * 
V — V 



w — w* 


1 



Under these conditions we show that u defined as 


satisfies 


u = arg min 
subject to 


— UX2'0 X3W 

| m ||2 < 1 , < c „ 


P 



< 


ct 

d*VL 


+ 


2cu 

d* 


2 (w*, v) {w*, w) 




> ^ exp (-tV2) 


1 

L3/2yi^ 



The proof will then follow by an application of this claim after each block update, 
and applying the identity for the intersection of such events (see the main paper). 
To prove the claim above, we start by noticing that 
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u= arg min — (d*) ^Y_X 2 VX‘iW 

nSM® 

subject to ||m ||2 < 1, < c„. 

Next we use the notation R for the row space of D: R = row{D) and R^ = 
null{D). Moreover, Py denotes the perpendicular projection onto the space V. 

Hence, by suboptimality, 

IIIm —M*||2 < 1 — u'^u* + (Y X 2 V X 3 w)'^ (u — u*) 

= 1-u'^u* + Jr ((d*u* ov* O* W + e) X 2 V Xswf (Pji + Pji±) (u - u*) 

= 1 - it'^u* + {v, V*) {w, w*) {u*Y {u-u*) + j^eX 2 V X 3 w{Pr + Pr±) {it - u*). 

( 22 ) 

Let us now bound the terms in the expression above. First, let ai,..., a/c be 
an orthonormal basis of R^. Then 

JrEj=t^(e X2'0 (m-M*) < 

< 

< 

for some constant c with probability at least 

Here we have used Mill's inequality and the fact that we can take ||aj||oo = 

0{L~^/‘^). The latter claim is immediate for ku = 0. If = 1 it can be proven as 
follows. First, we set ao = (1/\/L, ..., l/\/Z)^ G M^. Then by the definition of 
P-*“, an induction argument shows that 


1 s^k„+l 
d* 

1 c 

d* y/L 2.^i=l 
1 c(ku+l)t 
d* y/L 


e X2V X3 wY' ctj\ ||aj||oo 2 \/2 
(e X2V X3 wY' ctj\ 


ayk+i = k ai ,2 - {k- l)ai,i (24) 

for k G {2,..., L — 1}, where aij is the j—th coordinate of Oi. But since Oi is a unit 
vector, simple algebra yields 


1 = 


(/, _ 1)2 {L-2)(L-1)(2{L-2)+1) 

-2 


® 1,2 + 


(L-2)(L-l)(2(L-2)+l) I (L-2)(L-1) 
6 “*“2 


(L-2)(L-l)(2(L-2)+l) 

6 


1,1 


( 25 ) 


Ctl,2- 

Now, since ai and oq are orthogonal, we must have 
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0 


^1,2 — 


- 1 aiA. 


(26) 


L{L-1) 

2 


(L-l)(L-2) 


Then the fact that ||ai||oo = follows from (l24l) , ([25b and (l26b . 

Next we bound the term involving the projection operator onto the space R 
in (|22ll . By Holder's inequality. 


4r (e XoV Pr (u -U*) < -jr 


— d* 


(e X 2 -0 Xg wf 


and hence, as in Corollary 4 from lWang et al.! (|2014jl . we find that 


P 


—e X2 V x^wPr {it 
a* 


d* ) - L^Py/hgL 


On the other hand, by the Cauchy-Schwarz inequality, we have 



1 - it'^u* + {v, V*) {w, w*) {it - u*) = (1 - {v, V*) {w, w*)) (||m *||2 - {u, u*)) 

< (1 - (w,w*)) ||m - 

(28) 

Combining (|22ll . (l23l) . (|27|) . (l28l) , and proceeding in similar fashion for the other 
updates, the identity 


F{AnBnC)= P(H)P(5 I A)F{C \AAB) 


for any events A,B and C implies the result. 

For the case of multiple factors, we have the following result. Suppose that 
the data is generated as 


i=i 

where E is tensor of white noise. Suppose that we have current parameters 
estimates of {u*}jyLjg, {v*}j, {w*}j, {d*}j which we denote by {vj}j, 

Let us now provide an error bound for the estimate of u*^ given all the other 
estimates. To that end, define 


29 
















Ujg 


arg mm - 


u -1 n X2 %o ><3 Wjo -y^A. 

j¥=jo 

subject to \\D “m||i < Cu 


jyVj,) Vj[Wj,) WjUj 


u^u = 1, 


and assume that < c„ and 


l^'io - Vj, II2 < iFio - Wjo II2 < 


This leads to the following lemma. 
Lemma 4. Under the definitions just given, 


V2 




if ct 2c„L^“+i/2 


+ 


d 


Jo 


+ U\ > 1- 


2 exp {—t'^/2) 


TT t 


L^/^y/\ogL\ Stt’ 



where 


U = 


d 


+ ^*j (^i ■ '^Jo) (K • *io) 


2 ° i^io 


Proof. To show this we proceed as follows. We start noticing that, by sub-optimality. 


— 1 ) ■ — ?/* 

2 “jo “jo 2 — d 


SjVjo (“^J • %o) (^i • ^io) W + d* {v* ■ Vjf) {vjJ ■ Wi. I u: 


) M*) +Mjo) 


d* '^J \^J ^JoJ \'^J '^JoJ '^J I '^j \^j ^Jo) K'^j '^Joj “jyj V “jo ' “JoJ 

+ 1 - %o • “jo + ^ X2 %o X 3 Wj, + d*^ {v*^ ■ Vj,) {wl ■ Wj,) {-u*^ + 


< 


^ EjVjo [-dj {Vj • %o) (^i ■ ^io) W + d* {v* • %o) (< ■ ^Jo) ^ 

+ 1 - %o • “jo + ^ X2 %o X 3 Wj, + (t;*^ • Vj,) {wl ■ Wj,) ul] {-u*^ + 


u 


10 ) 


and hence the claim follows for the case J = 1. 


□ 
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C Discussion and extensions 


C.l Further connections with existing work 


In recent years, many different efforts have been made to apply the ideas of 
sparse regression and sparse matrix decomposition to the context of higher- 
order tensors. Our paper has shown that structured penalties from the generalized- 
lasso class can offer significant modeling benefits when the underlying factors 
are piecewise constant or smooth. Moreover, our main result shows that the 
factors can be efficiently computed by a coordinate-wise optimization routine, 
exploiting results on the solution path of the dual problem for the generalized 
lasso. Both the simulated and real examples have shown the power of the ap¬ 
proach. 

Our general framework has applications across a wide class of problem for¬ 
mulations for analyzing multi-way data. In this section, we describe some con¬ 
nections with other existing methods. We also describe how orthogonality con¬ 
straints can be imposed in our approach. 

Recall that in the usual PC A framework we are given samples xi,... ,Xm £ 
M'^, and the task is to find a unit vector a e such that the points xja,..., x'^a, 
on the real line have the largest possible variance. The problem can be stated in 
matrix notion as 

maximize a^X'^Xa. 

||a||2=l 


By imposing LI constraints, the authors of llolliffe et all ( 2003 ) propose to sac¬ 
rifice the variance explained in order to gain interpretability The resulting prob¬ 
lem, called SCoTLASS, is 


maximize a^X'^Xa subejct to 


< 1, ||a||i < c. 


More generally, the authors of Lu et al.l ( 2008|) consider Multilinear Principal 
Component Analysis of Tensor Objects (MPCA). This is defined for a set of ten¬ 
sors Ai, A 2 ,..., Am e i^y solution to the following problem: 

{U^,m = 1,..M} = argmax T;,, 

{UT,m=l,..M} 


where = Am Xi t/f X 2 ■ ■ ■ Xm Um, m = 1,..., M and T;, is their total scatter. 
The motivation is to perform feature extraction by determining a multilinear 
projection that captures most of the original tensorial input variation. The fitting 
algorithm proceeds by iteratively decomposing the original problem to a series 
of multiple projection sub-problems. 

Combining the regularization idea of SCoTLASS with MPCA, we can formu- 
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late the penalized MPCA problem as 


maximize } I (A — 

k=\ 

subject to Ps{u) < cs, Pt{v) < ct, \\u\\l, \\v\\l < 1, 

where X is the sample mean of the training data Xi, X 2 ,..., Xk- The solution 
to this problem allows us to project the training data into a lower-dimensional 
space in a way that maximizes the variance explained while retaining structural 
constraints in the projection space. The key point is that we can use the rank-1 
PTD algorithm to solve (l29l) , since it can be verified that (l29l) is equivalent to 


) X1UX2V 


(29) 


maximize Yxiux 2 vx^w 

u,v,w 

subject to Ps (u) < cs Pt (v) < ct 

\\u\\l < 1, Hulls < 1, ll^lls < 1, 


where F G is a tensor satisfying that ^ = (Xj,) This coim ection is, 

in fact, analogous to the coimection established in IWitten et all (|2009ll between 
SCoTLASS and the PMD algorithm. 


C.2 Orthogonal factors 

We now return to the multiple-factor decomposition proposed in the main pa¬ 
per. Given and input data tensor Y, we seek to find a decomposition as the sum 
of k rank-1 tensors, as in the Parafac model. We proposed an algorithm to find 
such a representation based on our algorithm for rank-1 PTD, but there were 
no constraints regarding the orthogonality of the vectors involved in the repre¬ 
sentation. Orthogonality is a natural constrai nt in factor-typ e models, and it is 
often i mposed in tensor decompo sitions; see Cichocki ( 2013 1: Kolda and Badeil 
( 2009 1: De Lathauwer et al.l (i2000ll . In the framework of matrix decomposition, 
the authors of lWitten et al.l (|2009tl explored an approach to obtain multiple rank- 
1 factors that were sparse and whose vectors were unlikely to be correlated. 
However, no formal guarantee was provided that the output vectors would be 
orthogonal. Here we fill that gap and provide a simple method for finding fac¬ 
tors whose vectors are orthogonal and satisfy structural constraints, including 
sparsity. 

Suppose that we are given k rank-1 tensors that approximate F. At the A: -|-1 
step, we try to find a rank-1 tensor that best approximates the current tensor 
of residuals. This is done by solving an optimization problem whose objective 
function is the Frobenius norm of the residual, with structural constraints spec- 
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ified by the chosen penalties. If we also impose the additional constraint of or¬ 
thogonality, then the update for u^+i can be written as the solution of a problem 
of the form 


mmimize u x 


s.t 


■u 


2 — 


< 1, ||£*m||i < c, Uj = 0 Vj = 1,fc — 1 


We can further rewrite this as 


minimize 9^ x 

u 


s.t 


l2 — 


< 1 , 


\\D0\U<c, 


(30) 


where the matrix D equals the product of D and a matrix whose columns form 
a basis of the orth ogonal complement of the space spaimed by see 

Witten et al.l ( 2009h . Hence, we can use our rank-1 PTD algorithm to find sparse 
orthogonal Parafac decompositions. 

The ortho gonality constraint imposes ad ditional computational burdens. As 
the authors of Arnold and Tib shir anil (|2015h point out, problems of the form (130)) 
can be solved efficiently if the matrix D is sparse. This can happen if the vectors 
are ui, U 2 ,.-,Uk-i are highly sparse. If, on the other hand, D is not sparse, then 


(l30b can be solve v ia its dual, using a projected-Newton method similar to the 
recent algorithm in Wang et al. ( 2014h . 


C.3 Multilinear regression 


Here we show how som e of the bas i c idea s in multilinear regression are related 
to our methodology. See lZhao et al.l (|2013ll for a discussion of multiline ar regres- 
sion. A more general approach for tensor regression is discussed in ICichocki 


d2ni3h . 


Motivated by the statistical settin g in iBaneriee et all (1200411 . and by the dis¬ 
cussion of tensor regression given in ICichockil (|2013ll , we consider the problem 
of finding single-factor representations of A e and F G such that 


J^^gpoqoa, g,dE'M.,pE 


q e 


CL G 


(31) 


The intuition behind dH) corresponds to a problem in which, for every time 
point t and location s, there exists an observation ys,t and a vector of covariates 
Xs,t,-. Hence it is natural to impose the constraint that the one-factor representa¬ 
tions of X and Y have common vectors associated with time and location. The 
difficulty of this problem lies in the fact that we need to simultaneously approx¬ 
imate X and Y by the representations in (ISTl) . Below we formally state a version 
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of this problem, incorporating some additional constraints that are merely tor 
identifiability purposes. 


minimize \\X — q p o q o a\\% + \\Y — dp o qW'i, 
peR^,geKr,aeK-^,3,deK 

subject to Ps{.p) < cs Priq) < ct Pj (a) < cj 

Ml < 1 ||g||^ < 1 ||a||^ < 1. 


Clearly the objective function in (1^ is a quadratic form for each of p, q, and 
a individually, while holding the other terms fixed. This can make the solving 
the problem complicated. Alternatively, we can try to m aximize the produ ct of 
the terms -<A,pogoa>- and -<Y_,poqy,as observed by lZhao et al.l (|2013h . But 
we notice the following elementary inequality: 


‘2{Y,poq){X,poqoa) < «A,Pogoa)+(y.pog))^ 

< {K.:P° q° cbY Y ° qY ■ 


Hence, it makes sense to solve the problem 


minimize (X, p o qo a) + -<Y_,poq) 

,g,d^R. 

subject to Ps (p) < Cs Pt (q) < ct 

Ml < 1 Ikll2 < 1 l|a|l 


Pj{a) < cj 
< 1 


(32) 


which has an trilinear obective function in (p, q, a) Thus, we can try to solve 
(l32l) by using coordinate wise optimization, taking advantage of our previous 
developments. 

Although we do not include simulations for problem (l32ll in our experiments 
section, our investigations suggest that combining the information of both the 
predictors X and the response Y_ can provide better results than just fitting a 
PTD on Y_ and a PMD on X separately. 


C.4 Extensions to Tucker models 


Up until now we have being interested in Parafac models, which are special 
cases of genera l Tucker model. A penalized Tucker model was proposed in 


Cichockil (120131) in which the goal is to maximize with respect to t/(n) e 


P ^ dn 


n = 1,... ,N the cost function 


Df {Y\\G, {U}) = IIU - G X {[/} III + J2^nCn , 

n 
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with penalties Ci,..., on ..., respectively and positive parameters 

OC\^ . . . ^ OCfi' 

We provide some insight on a penalized Tucker problem with generalized- 
lasso penalties on the columns of each For simplicity of nofation, we as¬ 
sume N = 3, Jn = 2, and n = {1,2,3}. Our formulation of fhe problem becomes 


minimize IIH - ° ° “SII^ 

subjecffo < Cn V?7,e{l,2, 3}, je{l,2} 

\\u^^\\l = l Wn E ,N} JE {1,2}. 


(33) 


This can be rewritten as an optimization problem whose objective function is is 
linear on each when the other variables are fixed, and convex on each 
when every other variable is fixed. Hence, we can use an algorithm similar to 
our rank-1 PTD procedure based on coordinate wise optimization. 

There is yet a different way to think about Tucker models. In this class of 
problems fhe core fensor is considered random, and the interest lies in recon¬ 
structing the matrices f/(n), 

n = 1,..., N, which are assumed to be invertible. 
The model is written as Y_ = Z x {U } whe re Z is an array of independenf 
sfandard normal entries; see Hoff et al. ( 201lh . There, the authors proved that 
cov {Y_) = El o S2 o • • • o Sat, with The matrices introduce 

covariance structure to the model. 

Given samples Y_i, ■ ■ ■ ,Yn we would like to estimate Ei,..., E„. Hence we form 
the following problem: 


maximize log P [Y^, ...,Yn \ Ei,..., E„) - 2 ^a„p(e„) , 

Sri 


where the constr aint is the s et of non-negative definite matrices. This formula¬ 
tion appeared in IHoff et al.l (|201lh . but without the penalties . Sir nilar formula¬ 
tions including penalties can be found in Leng and Tangl ( 2012 ) and Friedman et al. 
( 2010h . In fact, a coordi nate descent type of algorithm can be used that is similar 
to the one proposed i n IHoff et al.l (l201lh. bu t tha t solves every subprob lem with 
methods described in Leng and Tang ( 2012h and Friedman et al. ( 2010 1. 


D Simulation details 

In our set of experimenfs we considered 5 differenf hidden rank-1 tensors con¬ 
structed as u o V o w where the vectors u, v and w are described below. The 
notation indicates that components i through j of fhe vecfor are all equal fo 
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the value x. 


Structure 1 

. M = {1,1,1,-1,-1,-1,0,0,0,0}. 


Structure 2 

• M = {0, 0, 0, —1, —1, —1, 0, 0, 0, 0}. 

• V = with Vi = cos ^12 71 j for i = 1, 2,..., 1000. 

• w = with Wi = cos ^9 tt j for i = 1, 2,..., 400. 


Structure 3 

. M = {0,0,0, 0,-1,-1,1,1,1,1}. 


• with Wi = -0.7) + (^) fori = 1,2, ..., 1000 . 

• Define wl = ^ for i = 1,..., 400. Then, set Wi = w[ (0.05 — w') for i = 
1, ..., 200 and Wi = {w[Y for i = 201,..., 400. 


Structure 4 

• u = {0,0, 0,0, 0,1,1,1,1,1} 

• Define w' = |^ for i = 1,..., 1000. Then, 
Vi = cos(7r w') + .65. 

. ^ = {o}^°Mi}l^?,{o}?°?,{i}i?,{o}«?. 


Structure 5 

. w = {-l,-l,0,0,l,l, 1,-1,-1,-1}. 

• V has 80% of its entries equal to zero and the remanining 20% are random 
numbers drawn from a standar normal distribution. 

• w has 92.5% of its entries equal to zero and the remanining 7.5% are random 
numbers drawn from a standar normal distribution. 
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E Real data examples additional details 

E.l Flu hospitalizations 

Our flu example uses aggregate, non-identiflable hospitalization records from 
each of the eight largest counties in Texas from January 1, 2003 fo December 30, 

2009. Our dafa-use agreement does not permit dissemination of these hospital 
records. We also use data on temperature and air quality (particulate matter) in 
these counties, which can be obtained directly from CDC Wonder (http://wonder.cdc.gov/). 

E.2 Motion capture 

To construct the tensors involved in the five fask considered, we use fhe vari¬ 
ables: the second coordinate for roof (variable 2), the first coordinate for upper- 
back (variable 10), the first coordinate for uppemeck (variable 19), the first coor¬ 
dinate for head (variable 22), the second coordinate for rhumerus (variable 28), 
rradius (variable 30), the second coordinate for Ihumerus (variable 40), Iradius 
(variable 42), the second coordinate for lhand (variable 44), Ifingers (variable 45), 
rtibia (variable 52), Ifibia (variable 59). 

For task 138-story we use videos corresponding to subject 138 in the moCap 
repository. Videos 11-14 are used to construct the training tensor while 15-18 are 
used to build the test tensor. 

To build task 107 walking we use videos from subjecf 107. For training we 
use videos 1-4 for training while videos 5-8 are used for testing. 

For task 09-run we use videos corresponding to subject 9. Videos 1-4 are used 
for training, and videos 5-8 are used for fesfing. 

To consfruct fask 138 marching we fake videos from subjecf 138. For training 
we use videos 1-4 for training while videos 5-8 are used for fesfing. 

Finally, for task 126, the training set is built using videos 1,3,6,8 while the test 
set uses videos 2,4,7,9. 
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